-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retain zero-mutation samples #44
Conversation
Small code change as seen in diff for
This increases the complete mutation matrix to 9104 samples from 9093 and the aligned mutation matrix to 8397 from 8388. So we're gaining 9 samples for cognoma. You can see specific samples added in the diff for |
In 9f9f675 I added a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One typo
The updates look good. The resulting binary matrix represents a list of high confidence mutation calls for all samples with matching mutation, gene expression, and clinical data.
I am curious, are there any samples in either mutation or gene expression data that are not in the clinical data? This seems unlikely, but we probably don't want to filter these samples either (we can infer its acronym.)
scripts/2.TCGA-process.py
Outdated
sample_ids is a pandas.Series | ||
""" | ||
sample_ids = pandas.Series(sample_ids) | ||
aconyms = sample_ids.map(sample_to_acronym) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aconyms
--> acronyms
I used the following code: samples_missing_clinical = sorted((gene_mutation_mat_df.index & expr_df.index).difference(clinmat_df.index))
for sample_id in samples_missing_clinical:
print(sample_id)
len(samples_missing_clinical) Turns out there are 389 missing samples: mutation or expression samples missing clinical data
Hmm this seems like an upstream issue and I'd prefer an upstream fix rather than hacking it ourselves. I propose we merge this PR and deal with this issue subsequently. |
Ah, this is interesting, and now that I think about it, totally expected. If I am remembering correctly (haven't confirmed) the clinical data stores mostly I am not sure where the upstream fix of this should live. Perhaps we should investigate sample specific vs. patient specific clinical data and merge mutation/gene exp calls on patient ID after the first merge on sample_id while retaining only patient specific identifiers (age, acronym, etc.) for these samples. |
I think it's simpler to not include metastatic tumors as there are not that many and they may break the independence between observation assumption of many classifiers (not sure if that really matters). |
Nearly all of the melanoma tumors (SKCM) are metastatic - these will be dropped if we go this route. |
FYI, I think this is not true and instead results from us filtering by sample types earlier in the notebook: cancer-data/scripts/2.TCGA-process.py Lines 165 to 171 in 93e4c53
Hence I gave my comment above a 👎 . |
Refs #43 (comment)
Note this may not retain all samples without mutations but with sequencing. However, it does retain all samples that we're aware of via Xena that have been sequenced.